Last updated: 2021-04-19

Before we start …

  • Download the workshop folder from GitHub via RStudio.
  • Open RStudio \(>\) File \(>\) New Project \(>\) Version Control \(>\) Git \(>\) paste url \(>\) Create Project
  • https://github.com/jensroes/bayesian-mm-workshop.git
  • Exercise R scripts, data, slides
  • Create a folder called “stanout”

Outline

  • Fit a Bayesian mixed effects model (diving in the deep end)
  • Why Bayesian and difference to (classical) frequentist approaches
  • Convergence and model diagnostics
  • Evaluating the posterior
  • What are priors? (shouldn’t this be the first item)
  • There will be exercises!!! (yay)

Fitting a model in brms

(Bürkner, 2017, 2018, 2019; Bürkner & Vuorre, 2019)

brms

  • R package for Bayesian models
  • (Almost) no more complicated to fit than lme4s.
  • More probability models than other regressions packages:
    • gaussian, lognormal, bernoulli, poisson, zero_inflated_poisson, skew_normal, shifted_lognormal, exgaussian, et cetera
    • and allows mixture models, nonlinear syntax, (un)equal variance signal-detection theory, multivariate models


  • Under the hood: brms creates Stan code to compile a probabilistic MCMC (Markov Chain Monte Carlo) sampler.
  • Compiling the sampler and for the sampler to obtain the full posterior can take times.

Simulating data

data <- jens_data_machine(intercept = 300, slope = 15)
participant_id trial_id condition y
1 1 a 146.77
1 2 b 351.40
1 3 a 253.43
1 4 b 276.41
1 5 a 364.75
1 6 b 438.49
1 7 a 352.91
1 8 b 303.78
1 9 a 299.22
1 10 b 345.09
1 11 a 272.22
1 12 b 262.51
1 13 a 449.28
1 14 b 304.86
1 15 a 206.04
1 16 b 237.94
1 17 a 367.13
1 18 b 227.32
1 19 a 192.63
1 20 b 440.75
2 1 a 238.09
2 2 b 275.00
2 3 a 264.87
2 4 b 349.95
2 5 a 150.96
2 6 b 398.78
2 7 a 327.69
2 8 b 386.37
2 9 a 326.53
2 10 b 413.74
2 11 a 130.91
2 12 b 195.03
2 13 a 298.16
2 14 b 253.52
2 15 a 307.96
2 16 b 493.99
2 17 a 192.82
2 18 b 240.48
2 19 a 298.44
2 20 b 330.29
3 1 a 394.93
3 2 b 293.61
3 3 a 348.40
3 4 b 375.91
3 5 a 268.00
3 6 b 450.01
3 7 a 377.94
3 8 b 556.62
3 9 a 202.57
3 10 b 300.41
3 11 a 493.59
3 12 b 253.92
3 13 a 244.74
3 14 b 303.64
3 15 a 317.13
3 16 b 411.92
3 17 a 196.78
3 18 b 249.36
3 19 a 397.43
3 20 b 279.52
4 1 a 513.23
4 2 b 276.32
4 3 a 236.85
4 4 b 404.06
4 5 a 556.66
4 6 b 406.02
4 7 a 264.36
4 8 b 75.78
4 9 a 103.97
4 10 b 129.18
4 11 a 408.45
4 12 b 273.95
4 13 a 252.34
4 14 b 317.80
4 15 a 422.01
4 16 b 215.61
4 17 a 308.77
4 18 b 274.98
4 19 a 346.53
4 20 b 251.17
5 1 a 401.18
5 2 b 250.72
5 3 a 411.29
5 4 b 350.52
5 5 a 234.26
5 6 b 149.76
5 7 a 281.42
5 8 b 220.08
5 9 a 402.38
5 10 b 149.35
5 11 a 251.26
5 12 b 319.77
5 13 a 351.42
5 14 b 313.78
5 15 a 255.81
5 16 b 294.27
5 17 a 573.11
5 18 b 247.92
5 19 a 326.51
5 20 b 266.11
6 1 a 266.46
6 2 b 394.30
6 3 a 388.02
6 4 b 324.39
6 5 a 280.62
6 6 b 363.95
6 7 a 230.35
6 8 b 364.55
6 9 a 283.47
6 10 b 429.16
6 11 a 255.24
6 12 b 352.45
6 13 a 355.65
6 14 b 280.48
6 15 a 350.92
6 16 b 344.30
6 17 a 88.48
6 18 b 400.45
6 19 a 413.44
6 20 b 361.68
7 1 a 308.97
7 2 b 287.49
7 3 a 366.43
7 4 b 47.38
7 5 a 347.00
7 6 b 285.35
7 7 a 202.44
7 8 b 430.84
7 9 a 231.60
7 10 b 333.74
7 11 a 92.89
7 12 b 362.26
7 13 a 481.67
7 14 b 474.93
7 15 a 314.34
7 16 b 255.93
7 17 a 288.68
7 18 b 416.72
7 19 a 393.69
7 20 b 294.19
8 1 a 308.97
8 2 b 211.06
8 3 a 117.45
8 4 b 225.80
8 5 a 261.97
8 6 b 348.78
8 7 a 55.56
8 8 b 279.14
8 9 a 323.95
8 10 b 288.41
8 11 a 385.14
8 12 b 399.57
8 13 a 208.22
8 14 b 369.05
8 15 a 372.32
8 16 b 319.12
8 17 a 358.49
8 18 b 248.22
8 19 a 338.30
8 20 b 414.36
9 1 a 249.64
9 2 b 232.00
9 3 a 284.46
9 4 b 444.94
9 5 a 399.21
9 6 b 130.52
9 7 a 452.20
9 8 b 245.22
9 9 a 287.85
9 10 b 478.32
9 11 a 383.22
9 12 b 323.17
9 13 a 73.88
9 14 b 315.64
9 15 a 549.33
9 16 b 186.68
9 17 a 315.13
9 18 b 335.46
9 19 a 79.69
9 20 b 324.29
10 1 a 213.31
10 2 b 356.67
10 3 a 199.43
10 4 b 415.68
10 5 a 239.13
10 6 b 403.43
10 7 a 243.31
10 8 b 516.38
10 9 a 255.01
10 10 b 488.39
10 11 a 364.68
10 12 b 286.22
10 13 a 316.01
10 14 b 378.38
10 15 a 424.93
10 16 b 404.71
10 17 a 254.83
10 18 b 283.76
10 19 a 159.04
10 20 b 318.36
11 1 a 393.91
11 2 b 337.13
11 3 a 265.65
11 4 b 397.23
11 5 a 264.06
11 6 b 258.92
11 7 a 393.86
11 8 b 272.36
11 9 a 328.63
11 10 b 277.08
11 11 a 234.92
11 12 b 251.31
11 13 a 297.21
11 14 b 297.72
11 15 a 315.23
11 16 b 306.55
11 17 a 288.68
11 18 b 380.01
11 19 a 338.88
11 20 b 286.28
12 1 a 209.64
12 2 b 156.19
12 3 a 242.45
12 4 b 431.52
12 5 a 176.21
12 6 b 346.58
12 7 a 301.23
12 8 b 173.45
12 9 a 251.23
12 10 b 312.07
12 11 a 151.77
12 12 b 434.67
12 13 a 431.34
12 14 b 218.28
12 15 a 204.38
12 16 b 282.83
12 17 a 352.88
12 18 b 207.85
12 19 a 349.51
12 20 b 438.64
13 1 a 306.82
13 2 b 385.90
13 3 a 524.59
13 4 b 347.52
13 5 a 369.65
13 6 b 193.95
13 7 a 469.44
13 8 b 237.60
13 9 a 447.65
13 10 b 436.41
13 11 a 271.02
13 12 b 143.91
13 13 a 189.96
13 14 b 492.73
13 15 a 356.88
13 16 b 390.18
13 17 a 181.73
13 18 b 384.14
13 19 a 550.27
13 20 b 250.58
14 1 a 409.70
14 2 b 296.46
14 3 a 333.46
14 4 b 416.39
14 5 a 33.42
14 6 b 393.73
14 7 a 187.21
14 8 b 371.13
14 9 a 383.83
14 10 b 330.04
14 11 a 429.42
14 12 b 243.23
14 13 a 162.62
14 14 b 257.31
14 15 a 462.39
14 16 b 283.78
14 17 a 408.96
14 18 b 568.04
14 19 a 134.65
14 20 b 193.81
15 1 a 343.13
15 2 b 157.29
15 3 a 384.99
15 4 b 263.16
15 5 a 346.48
15 6 b 288.23
15 7 a 417.68
15 8 b 252.17
15 9 a 410.00
15 10 b 332.70
15 11 a 248.48
15 12 b 326.27
15 13 a 348.04
15 14 b 488.50
15 15 a 397.37
15 16 b 233.27
15 17 a 446.20
15 18 b 510.11
15 19 a 322.19
15 20 b 338.80
16 1 a 314.77
16 2 b 326.74
16 3 a 106.06
16 4 b 513.62
16 5 a 375.93
16 6 b 126.64
16 7 a 377.46
16 8 b 415.35
16 9 a 409.03
16 10 b 362.12
16 11 a 333.70
16 12 b 152.71
16 13 a 484.67
16 14 b 348.73
16 15 a 285.97
16 16 b 392.87
16 17 a 445.12
16 18 b 482.27
16 19 a 276.61
16 20 b 148.15
17 1 a 314.18
17 2 b 315.14
17 3 a 388.47
17 4 b 343.69
17 5 a 179.58
17 6 b 257.48
17 7 a 378.30
17 8 b 313.38
17 9 a 392.70
17 10 b 287.33
17 11 a 376.70
17 12 b 314.43
17 13 a 259.01
17 14 b 300.75
17 15 a 250.59
17 16 b 358.74
17 17 a 329.81
17 18 b 205.95
17 19 a 303.54
17 20 b 334.52
18 1 a 282.14
18 2 b 197.14
18 3 a 315.60
18 4 b 393.66
18 5 a 252.87
18 6 b 339.29
18 7 a 365.87
18 8 b 309.92
18 9 a 316.84
18 10 b 297.70
18 11 a 342.74
18 12 b 239.63
18 13 a 289.75
18 14 b 95.18
18 15 a 232.61
18 16 b 212.50
18 17 a 322.21
18 18 b 284.04
18 19 a 265.09
18 20 b 334.59
19 1 a 363.50
19 2 b 355.08
19 3 a 239.84
19 4 b 438.52
19 5 a 294.33
19 6 b 331.43
19 7 a 281.62
19 8 b 268.15
19 9 a 195.10
19 10 b 184.32
19 11 a 304.31
19 12 b 257.80
19 13 a 479.02
19 14 b 402.81
19 15 a 228.26
19 16 b 350.27
19 17 a 411.54
19 18 b 396.51
19 19 a 404.13
19 20 b 226.71
20 1 a 353.74
20 2 b 381.69
20 3 a 300.21
20 4 b 213.98
20 5 a 301.16
20 6 b 368.09
20 7 a 247.77
20 8 b 397.42
20 9 a 254.07
20 10 b 312.14
20 11 a 441.34
20 12 b 210.43
20 13 a 264.89
20 14 b 242.05
20 15 a 586.56
20 16 b 90.10
20 17 a 143.79
20 18 b 215.34
20 19 a 234.45
20 20 b 392.53
21 1 a 325.64
21 2 b 380.57
21 3 a 321.96
21 4 b 388.48
21 5 a 285.23
21 6 b 405.54
21 7 a 450.52
21 8 b 440.75
21 9 a 308.51
21 10 b 249.65
21 11 a 167.80
21 12 b 291.84
21 13 a 268.96
21 14 b 337.52
21 15 a 293.84
21 16 b 497.97
21 17 a 357.70
21 18 b 192.67
21 19 a 144.08
21 20 b 368.95
22 1 a 380.78
22 2 b 246.06
22 3 a 206.79
22 4 b 295.82
22 5 a 319.84
22 6 b 304.14
22 7 a 323.22
22 8 b 256.93
22 9 a 101.93
22 10 b 237.52
22 11 a 290.22
22 12 b 265.50
22 13 a 287.26
22 14 b 453.07
22 15 a 228.58
22 16 b 296.65
22 17 a 154.20
22 18 b 200.48
22 19 a 189.33
22 20 b 368.47
23 1 a 154.02
23 2 b 117.88
23 3 a 397.77
23 4 b 259.66
23 5 a 403.92
23 6 b 218.79
23 7 a 500.48
23 8 b 336.08
23 9 a 235.98
23 10 b 224.19
23 11 a 339.50
23 12 b 208.78
23 13 a 359.65
23 14 b 258.42
23 15 a 172.56
23 16 b 371.75
23 17 a 276.91
23 18 b 362.48
23 19 a 409.05
23 20 b 288.75
24 1 a 276.12
24 2 b 179.87
24 3 a 305.68
24 4 b 443.12
24 5 a 46.13
24 6 b 419.62
24 7 a 472.88
24 8 b 293.56
24 9 a 148.58
24 10 b 246.83
24 11 a 141.94
24 12 b 317.71
24 13 a 295.87
24 14 b 138.86
24 15 a 430.96
24 16 b 166.25
24 17 a 314.58
24 18 b 100.03
24 19 a 394.47
24 20 b 223.26
25 1 a 256.81
25 2 b 406.70
25 3 a 339.99
25 4 b 433.10
25 5 a 310.48
25 6 b 331.62
25 7 a 287.20
25 8 b 417.24
25 9 a 55.10
25 10 b 440.60
25 11 a 393.72
25 12 b 306.49
25 13 a 313.41
25 14 b 171.30
25 15 a 283.85
25 16 b 403.39
25 17 a 377.15
25 18 b 253.58
25 19 a 368.62
25 20 b 382.25
26 1 a 324.98
26 2 b 290.82
26 3 a 233.33
26 4 b 241.61
26 5 a 233.44
26 6 b 127.50
26 7 a 260.47
26 8 b 338.76
26 9 a 432.16
26 10 b 201.06
26 11 a 222.65
26 12 b 301.64
26 13 a 134.29
26 14 b 189.79
26 15 a 163.20
26 16 b 266.31
26 17 a 150.63
26 18 b 342.65
26 19 a 207.33
26 20 b 291.52
27 1 a 258.74
27 2 b 331.57
27 3 a 257.16
27 4 b 376.43
27 5 a 375.94
27 6 b 254.52
27 7 a 137.18
27 8 b 393.34
27 9 a 284.06
27 10 b 442.80
27 11 a 409.05
27 12 b 183.47
27 13 a 362.75
27 14 b 549.06
27 15 a 224.39
27 16 b 213.77
27 17 a 261.17
27 18 b 289.65
27 19 a 191.89
27 20 b 357.12
28 1 a 412.47
28 2 b 479.24
28 3 a 348.89
28 4 b 354.61
28 5 a 233.40
28 6 b 405.19
28 7 a 310.65
28 8 b 342.70
28 9 a 163.27
28 10 b 261.94
28 11 a 383.76
28 12 b 199.95
28 13 a 193.28
28 14 b 296.08
28 15 a 234.03
28 16 b 443.38
28 17 a 418.35
28 18 b 156.00
28 19 a 253.67
28 20 b 431.98
29 1 a 411.11
29 2 b 537.51
29 3 a 241.73
29 4 b 373.19
29 5 a 210.55
29 6 b 261.46
29 7 a 343.86
29 8 b 324.68
29 9 a 244.01
29 10 b 317.12
29 11 a 315.65
29 12 b 345.90
29 13 a 110.90
29 14 b 528.97
29 15 a 413.21
29 16 b 145.94
29 17 a 397.77
29 18 b 362.10
29 19 a 278.63
29 20 b 270.61
30 1 a 211.62
30 2 b 257.03
30 3 a 297.41
30 4 b 549.76
30 5 a 214.67
30 6 b 279.44
30 7 a 464.34
30 8 b 395.51
30 9 a 161.42
30 10 b 337.78
30 11 a 32.47
30 12 b 396.03
30 13 a 361.29
30 14 b 398.22
30 15 a 248.02
30 16 b 432.86
30 17 a 285.53
30 18 b 309.88
30 19 a 322.52
30 20 b 500.40

Simulating data

data <- jens_data_machine(intercept = 300, slope = 15)

  • You will use the same “data” in the exercises.
  • Think about it as keystroke intervals :)
  • We want to uncover the slope of 15 (difference between conditions) from the data.
  • After accounting for variability associated with the sampling process.

Simulating data

data <- jens_data_machine(intercept = 300, slope = 15)

Formulating the model

\[ y_i \sim N(\mu_i, \sigma^2)\\ \mu_i = \beta_0 + \text{condition}_i\cdot\beta_1 + \text{participant}_i\\ \]

  • Outcome variable \(y\)
  • \(N\): normal distribution with a mean \(\mu\) and a residual (error) variance \(\sigma^2\)
  • Tilde symbol (\(\sim\)): “is distributed according to”
  • Equal sign (\(=\)): deterministic relationship
  • Subscript \(i\): every observation in \(1 \dots N\).
  • Greek symbols: unknown population parameter values (caret indicates estimated, e.g. \(\hat{\beta}\))

Formulating the model

\[ y_i \sim N(\mu_i, \sigma^2)\\ \mu_i = \beta_0 + \text{condition}_i\cdot\beta_1 + \text{participant}_i\\ \]

  • Treatment contrast: condition a = 0, condition b = 1
contrasts(factor(data$condition)) # default contrast
  b
a 0
b 1
  • Given the data \(y\) and this model, we can estimate \(\beta_0\) and \(\beta_1\).
  • For a Bayesian model we still need priors (end of session).
  • Intercept \(\beta_0\): estimated mean of condition a.
  • Slope \(\beta_1\): difference between condition a and b.
  • Participant variance with \(N(0,\sigma_u)\)

R syntax

  • Frequentist mixed-effects models
library(lme4)
fit_lmer <- lmer(y ~ 1 + condition + (1|participant_id), data = data)
  • Bayesian mixed-effects models
library(brms)
fit_brm <- brm(y ~ 1 + condition + (1|participant_id), data = data)

Exercise: fit models on simulated data

  • Complete and run scripts lmer_sim.R and brms_sim.R
  • Replace --- correctly; run the script carefully line by line.
  • Check the output in the console.
  • Inspect and compare the fixed effects summaries.

Fit models on simulated data

data <- jens_data_machine(intercept = 300, slope = 15)
fit_lmer <- lmer(y ~ 1 + condition + (1|participant_id), data = data)
fit_brm <- brm(y ~ 1 + condition + (1|participant_id), data = data)

Fit models on simulated data

data <- jens_data_machine(intercept = 300, slope = 15)
coef(summary(fit_lmer)) # Coefficients of frequentist model
             Estimate Std. Error   t value
(Intercept) 300.26567   5.789763 51.861484
conditionb   15.59151   8.136926  1.916143
fixef(fit_brm) # Coefficients of Bayesian model
            Estimate Est.Error        Q2.5     Q97.5
Intercept  300.16663  5.988561 288.8796244 311.79417
conditionb  15.76034  8.047246  -0.0991772  31.46148

Why go Bayesian?

see Kruschke (2014). Full disclosure: I’m biased!

Two universes

  • Two different ways of generalising from sample to the population:
  • What do the data tell us about the population?
  • Null-hypothesis significance testing (NHST)
  • Bayesian

Null-hypothesis significance testing (NHST)

  • Classical / frequentist statistics; p-value based statistics
  • p-value: How plausible are the data (or something more extreme) if we assume that there’s no effect?
  • Evaluating the probability of data (e.g. t-value, F-statistic) assuming that the null hypothesis \(H_0\) is true.
  • If the data are extreme enough, \(H_0\) is concluded to be implausible (rejected).
  • If \(H_0\) is implausible, we assume \(H_1\) (alternative hypothesis).
  • Statistically significant means data are unexpected under \(H_0\).

\[ p(\text{data} \mid H_0 = \text{TRUE}) \]

NHST and its problems

  • Statistical significance is not binary but continuous.
  • How often are we actually interested in or seriously believe in \(H_0\)?
  • What are you doing if \(p=0.051\)?
  • Stopping rule
  • \(\alpha\)-level of 0.05 implies that there is a 5% chance that:
    • our effect doesn’t replicate (isn’t real)
    • we will not observe an effect that is real
  • This seems even worse in reality (Amrhein et al., 2019; Loken & Gelman, 2017)


  • Trade-off of statistical significance, effect size, and sample size:
    • Smaller effects can become significant when the sample is large enough.
    • But how plausible is a small effect (e.g. mean difference, correlation) for what we are interested in?
    • What really matters is that the size of the effect is what we would expect it to be.
  • Systematic misinterpretations of p-values (Colquhoun, 2017; Greenland et al., 2016) and CIs (Hoekstra et al., 2014; Morey et al., 2016).

Convergence failure for lmer models


outcome ~ condition + (condition|participant) + (condition|item)


Convergence failure for lmer models

  • The statistical model should be an appropriate representation of the process that generates our data.
  • Maximal random effects structure (Baayen et al., 2008; Barr et al., 2013)
  • Overparametrisation: unidentifiable parameter estimates (Bates et al., 2015)


  • Solutions: reducing model complexity, changing optimizers
  • Model selection based on failure to fit a more complex one.
  • Some sources of random errors will not be addressed.
  • Bayesian model converge by definition (as the number of iterations approaches \(\infty\)).

Bayes’ Theorem

  • Should I bring my umbrella?
  • I live in Nottingham: 122.1 rainy days per year
  • I live in Los Angeles: 35.0 rainy days per year
  • Forecast says 50% rain.
  • My window says it’s raining.
  • Bayesian inference means updating one’s belief about something as the new information becomes available.
  • Bayesian inference is about uncertainty in belief in parameter values (e.g. a difference between means) and doesn’t rely on fixed constants (e.g. p-values < 0.05).

Bayes’ Theorem

\[ p(\theta \mid y) \propto p(\theta) \cdot p(y \mid \theta) \]

  • \(\theta\) = parameter(s), \(y\) = data, \(\mid\) = given (conditional on), \(\propto\) = proportional to
  • \(p(y \mid \theta)\) = likelihood (probability of data given model parameter values)
  • \(p(\theta)\) = prior (what do we know about model parameter(s))
  • \(p(\theta \mid y)\) = posterior distribution (our believe in different parameter values after seeing the data)

Bayes’ Theorem

\[ p(\theta \mid y) \propto p(\theta) \cdot p(y \mid \theta) \]

  • Bayes’ Theorem allows us to determine probability / uncertainty distributions of parameter values.
  • Frequentist models: maximum likelihood estimation we search for the parameters that maximize the log-likelihood.
  • Bayesian estimation uses the likelihood to update existing beliefs in different parameter values.
  • How much we believe in different parameter values depends on the data and what we already know.
  • We update our beliefs as new information becomes available.

Why is Bayes important for linguistis (and everyone else)?

It tells us what we want to know!


Given the data (and our prior knowledge).


  • Summary stats of a Bayesian model are often very similar to what people think frequentist quantities (p-values, CIs) mean (Nicenboim & Vasishth, 2016):
  • What’s the relative weight of evidence for one model (e.g., \(H_0\)) vs. another (e.g., \(H_1\))?
  • What interval contains an unknown parameter value with .95 probability?

Why is Bayes important for linguistis (and everyone else)?

Instead of


\[ p(\text{data} \mid H_0 = \text{TRUE}) \]

we can


\[ p(H \mid \text{data}) \]

Why is Bayes important for linguistis (and everyone else)?

  • What’s the probability of an effect given the data (and what we know)?
  • Useful for small data sets for which what we already know (prior information) plays a huge role.
  • Probabilisitic sampling can handle: convergence failures, missing data problems, complex models (many parameters, sources of random error), ceiling / floor effects, identifyability issues.
  • Results are not dependent on sampling plan (Kruschke, 2018).
  • Flexible data modeling is now easily available (brms, rstanarm, rethinking).

Why is Bayes important for linguistis (and everyone else)?

  • Estimate credibility over parameter values based on the observed data (Kruschke, 2014).
  • Given the data, what parameter values should we believe in with most confidence.
  • For this we need to start with
  1. a probability model (e.g. a normal distribution)
  2. prior expectations (on e.g. means, difference, variability)
  • brms uses probabilistic sampling to determine the posterior uncertainty about the parameter value of the probability model given the prior and the data.
  • This involve calculating the weighted probability of the data given some parameter values (which can take time).

Convergence and model diagnostics

see Lambert (2018) for HMC

So what about this?

  • Progress of probabilistic sampler.
  • 2,000 iterations for parameter estimation per chain
  • iterations / 2 warm-up samples: discarded eventually
  • 4 chains to establish convergence
  • Change cores (check parallel::detectCores()) for running chains in parallel.
  • How many posterior samples have we got for our inference (= chains * (iterations - warm-up))?
  • How many iterations / chains do you need?

Parameter estimation

  • Hamiltonian Monte Carlo: Markov chain Monte Carlo method for obtaining random samples.
  • Converge to being distributed according to a target probability distribution.
  • Probability of random samples is estimated given the data and the prior.
  • Direction changes if probability of proposed estimate is lower than previous one.
  • Parameter space is really multi-dimensional.

Parameter estimation

  • Hamiltonian Monte Carlo: Markov chain Monte Carlo method for obtaining random samples.
  • Converge to being distributed according to a target probability distribution.
  • Probability of random samples is estimated given the data and the prior.
  • Direction changes if probability of proposed estimate is lower than previous one.
  • Parameter space is really multi-dimensional.

Parameter estimation

  • Hamiltonian Monte Carlo: Markov chain Monte Carlo method for obtaining random samples.
  • Converge to being distributed according to a target probability distribution.
  • Probability of random samples is estimated given the data and the prior.
  • Direction changes if probability of proposed estimate is lower than previous one.
  • Parameter space is really multi-dimensional.

Parameter estimation

Parameter estimation

Parameter estimation

Parameter estimation

Parameter estimation

Parameter estimation

Parameter estimation

Exercise: model diagnostics

  • Traceplots: we want fat hairy caterpillars
plot(fit_brm, pars = c("b_Intercept", "b_conditionb"))
  • \(\hat{R}\) convergence statistic; should be \(<1.1\) (Gelman & Rubin, 1992)
rhat(fit_brm, pars = c("b_Intercept", "b_conditionb")) 
  • Posterior predictive checks: compare observed data \(y\) and model predictions \(y_{rep}\)
pp_check(fit_brm, nsamples = 100)
  • Replace the ---s in scripts model_diagnostics.R and run the code (line by line).

Posterior probability distribution

see e.g. Nicenboim & Vasishth (2016) for a tutorial

Posterior probability distribution

  • Get posterior of slope \(\beta\) (difference between conditions)
beta <- posterior_samples(fit_brm, pars = "b_conditionb") %>% pull()
length(beta)
[1] 4000
beta[1:5]
[1] 26.169134 20.234850  9.773276 21.438894 12.943398

Posterior probability distribution

Posterior probability distribution

  • Posterior mean
mean(beta)
[1] 15.76034

Posterior probability distribution

  • 95% probability interval (conceptually different from confidence interval)
quantile(beta, probs = c(0.025, 0.975))
      2.5%      97.5% 
-0.0991772 31.4614818 

Posterior probability distribution

  • 95% probability interval (conceptually different from confidence interval)
quantile(beta, probs = c(0.025, 0.975))
      2.5%      97.5% 
-0.0991772 31.4614818 
  • 89% probability interval (McElreath, 2016)
quantile(beta, probs = c(0.055, 0.945))
     5.5%     94.5% 
 2.804782 28.344642 

Posterior probability distribution

  • Probability that slope \(\beta\) is negative: \(P(\hat{\beta}<0)\)
mean(beta < 0)
[1] 0.02775

Posterior probability distribution

  • Probability that slope \(\beta\) is negative: \(P(\hat{\beta}<0)\)
mean(beta < 0)
[1] 0.02775
  • Probability that slope \(\beta\) is smaller than 10: \(P(\hat{\beta}<10)\)
mean(beta < 10)
[1] 0.23375
  • More in the exercises: parameter_evaluation.R

What are priors?

chapter 7 in Lee & Wagenmakers (2014)

Priors

\[ \text{posterior} \propto \text{prior} \cdot \text{likelihood} \]

  • Prior knowledge about plausible parameter values.
  • This knowledge is expressed as probability distributions (e.g. normal distributions).
  • Help probabilistic sampler by limiting the parameter space.
  • Small data samples are sensitive to prior information which makes intuitively sense.
  • Otherwise data typically overcome the prior (automatic Ockham’s razor).
  • Less common: test data against a (prior) effect suggested by the literature.
  • We know more than we sometimes think!

Priors: intercept

  • A priori, each value in the parameter space is equally possible.
  • Let’s think about the parameter space for models of keystroke data.

Priors: intercept

  • Intercepts are some form of average.
  • Can keystroke intervals range between -\(\infty\) and \(\infty\)?
  • What are the lower and upper end?

Priors: intercept

  • Intercepts are some form of average.
  • Can keystroke intervals range between -\(\infty\) and \(\infty\)?
  • What are the lower and upper end?
  • How long is the average pause before a sentence?

\[ \text{pre-sentence pause} \sim N(???, ???) \]

Priors: intercept

  • Intercepts are some form of average.
  • Can keystroke intervals range between -\(\infty\) and \(\infty\)?
  • What are the lower and upper end?
  • How long is the average pause before a sentence?

\[ \text{pre-sentence pause} \sim N(1000 \text{ msecs}, ???) \]

Priors: intercept

  • Intercepts are some form of average.
  • Can keystroke intervals range between -\(\infty\) and \(\infty\)?
  • What are the lower and upper end?
  • How long is the average pause before a sentence?
  • How short / long can this be?
  • Plausible probability distribution for pre-sentence pauses:

\[ \text{pre-sentence pause} \sim N(1000 \text{ msecs}, ???) \]

Priors: intercept

  • Intercepts are some form of average.
  • Can keystroke intervals range between -\(\infty\) and \(\infty\)?
  • What are the lower and upper end?
  • How long is the average pause before a sentence?
  • How short / long can this be?
  • Plausible probability distribution for pre-sentence pauses:

\[ \text{pre-sentence pause} \sim N(1000 \text{ msecs}, 500\text{ msecs}) \]

Priors: intercept

  • Intercepts are some form of average.
  • Can keystroke intervals range between -\(\infty\) and \(\infty\)?
  • What are the lower and upper end?
  • How long is the average pause before a sentence?
  • How short / long can this be?
  • Plausible probability distribution for pre-sentence pauses:

\[ \text{pre-sentence pause} \sim N(1000 \text{ msecs}, 500\text{ msecs}) \]

Priors: slope

  • Say, we compare people writing in L1 and L2.
  • Writing in L2 can be harder because of, e.g., lexical retrieval.
  • Hence, pre-word / pre-sentence keystroke intervals are sometimes longer.
  • What is a plausible prior for this delay?

Priors: slope

  • Say, we compare people writing in L1 and L2.
  • Writing in L2 can be harder because of, e.g., lexical retrieval.
  • Hence, pre-word / pre-sentence keystroke intervals are sometimes longer.
  • What is a plausible prior for this delay?

\[ \text{slowdown for L2s} \sim N(???, ???) \]

Priors: slope

  • Say, we compare people writing in L1 and L2.
  • Writing in L2 can be harder because of, e.g., lexical retrieval.
  • Hence, pre-word / pre-sentence keystroke intervals are sometimes longer.
  • What is a plausible prior for this delay?

\[ \text{slowdown for L2s} \sim N(250 \text{ msecs}, 100\text{ msecs}) \]

Priors: slope

  • Say, we compare people writing in L1 and L2.
  • Writing in L2 can be harder because of, e.g., lexical retrieval.
  • Hence, pre-word / pre-sentence keystroke intervals are sometimes longer.
  • What is a plausible prior for this delay?

\[ \text{slowdown for L2s} \sim N(250 \text{ msecs}, 100\text{ msecs}) \]

Priors: slope

  • We often don’t know what the effect could be.
  • However, we have an intuition about what’s plausible and what isn’t.
  • E.g., words with alternative spellings (accordian, accordion) may or may not lead to longer pauses (than words with less possible spellings; aspergus).
  • We are not sure, so let’s use a mean of 0 msecs; what would be a possible SD?

Priors: slope

  • We often don’t know what the effect could be.
  • However, we have an intuition about what’s plausible and what isn’t.
  • E.g., words with alternative spellings (accordian, accordion) may or may not lead to longer pauses (than words with less possible spellings; aspergus).
  • We are not sure, so let’s use a mean of 0 msecs; what would be a possible SD?

\[ \text{slope} \sim N(0 \text{ msecs}, ???) \]

Priors: slope

  • We often don’t know what the effect could be.
  • However, we have an intuition about what’s plausible and what isn’t.
  • E.g., words with alternative spellings (accordian, accordion) may or may not lead to longer pauses (than words with less possible spellings; aspergus).
  • We are not sure, so let’s use a mean of 0 msecs; what would be a possible SD?

\[ \text{slope} \sim N(0 \text{ msecs}, 100\text{ msecs}) \]

  • We could even truncate the prior according to a lower / upper threshold.

Priors: slope

  • We often don’t know what the effect could be.
  • However, we have an intuition about what’s plausible and what isn’t.
  • E.g., words with alternative spellings (accordian, accordion) may or may not lead to longer pauses (than words with less possible spellings; aspergus).
  • We are not sure, so let’s use a mean of 0 msecs; what would be a possible SD?

\[ \text{slope} \sim N(0 \text{ msecs}, 100\text{ msecs}) \]

  • We could even truncate the prior according to a lower / upper threshold.

Priors

  • Even weakly informative priors are helpful for estimating parameter values.
  • They help to constrain the parameter space.
  • The parameter space should reflect our beliefs.
  • This is important for more complex models (mixtures, Wiener diffusion models).

Priors

Check defaults used earlier:


fit_brm <- readRDS(file = "../stanout/brms_sim.rda")
prior_summary(fit_brm)
prior class coef group
(flat) b
(flat) b conditionb
student_t(3, 308.2, 98.7) Intercept
student_t(3, 0, 98.7) sd
student_t(3, 0, 98.7) sd participant_id
student_t(3, 0, 98.7) sd Intercept participant_id
student_t(3, 0, 98.7) sigma

Student t-distribution

  • Symmetric continuous distribution with fat tails assigning more probability to extreme values.
  • \(\nu\) parameter controls the degrees of freedom
  • \(\nu = 1\) is a Cauchy distribution
  • When \(\nu \rightarrow \infty\) the t-distribution becomes Gaussian.

Model specification

Instead of


fit <- brm(outcome ~ predictor + (1|participant), data = data)


do …


# Create model
model <- bf(outcome ~ predictor + (1|participant))
# specifying model formula outside of brms works for lmer too.
# bf = brmsformula

# Fit brms
fit <- brm(model, data = data)

Prior specification

# Create model
model <- bf(outcome ~ predictor + (1|participant))

# Look at priors: some have reasonable defaults, others are flat.
get_prior(model, data = data)

# Specify priors
prior <- set_prior("normal(1000, 500)", class = "Intercept", lb = 0, ub = 100000) +
         set_prior("normal(0, 100)", class = "b") # for slope(s)

# Fit brms
fit <- brm(model, data = data, prior = prior)

Exercise: priors

  • Adding priors to last session’s model in brms_sim_with_prior.R; replace the ---s according to the comments in the script.
  • If you have time, make the standard deviation of the prior for slope \(b\) either much larger or smaller and check how this affects the posterior estimate.

Prior simulation

Prior simulation

Prior simulation

Prior simulation

The end

For next session

  • We will compare different models of keystroke intervals (same data; 5 different probability models).
  • Run the following models and save their posterior:
  1. gaussian.R
  2. lognormal.R
  3. exgaussian.R
  4. mixturemodel.R
  5. skewnormal.R (spoiler: don’t run this one, if you don’t have time)
  • We need their posteriors for the next session.
  • Don’t start this last minute :)

References

Amrhein, V., Trafimow, D., & Greenland, S. (2019). Inferential statistics as descriptive statistics: There is no replication crisis if we don’t expect replication. The American Statistician, 73(sup1), 262–270.

Baayen, R. H., Davidson, D. J., & Bates, D. M. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59(4), 390–412.

Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68(3), 255–278.

Bates, D., Kliegl, R., Vasishth, S., & Baayen, H. (2015). Parsimonious mixed models. arXiv Preprint arXiv:1506.04967.

Bürkner, P.-C. (2017). brms: An R package for Bayesian multilevel models using Stan. Journal of Statistical Software, 80(1), 1–28. https://doi.org/10.18637/jss.v080.i01

Bürkner, P.-C. (2018). Advanced Bayesian multilevel modeling with the R package brms. The R Journal, 10(1), 395–411. https://doi.org/10.32614/RJ-2018-017

Bürkner, P.-C. (2019). Bayesian item response modeling in R with brms and Stan. arXiv Preprint arXiv:1905.09501.

Bürkner, P.-C., & Vuorre, M. (2019). Ordinal regression models in psychology: A tutorial. Advances in Methods and Practices in Psychological Science, 2(1), 77–101.

Colquhoun, D. (2017). The reproducibility of research and the misinterpretation of p-values. Royal Society Open Science, 4(12), 171085.

Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7(4), 457–472.

Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, p values, confidence intervals, and power: A guide to misinterpretations. European Journal of Epidemiology, 31(4), 337–350.

Hoekstra, R., Morey, R. D., Rouder, J. N., & Wagenmakers, E.-J. (2014). Robust misinterpretation of confidence intervals. Psychonomic Bulletin & Review, 21(5), 1157–1164.

Kruschke, J. K. (2014). Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan.

Kruschke, J. K. (2018). Rejecting or accepting parameter values in Bayesian estimation. Advances in Methods and Practices in Psychological Science, 1(2), 270–280.

Lambert, B. (2018). A student’s guide to Bayesian statistics. Sage.

Lee, M. D., & Wagenmakers, E.-J. (2014). Bayesian cognitive modeling: A practical course. Cambridge University Press.

Loken, E., & Gelman, A. (2017). Measurement error and the replication crisis. Science, 355(6325), 584–585.

McElreath, R. (2016). Statistical rethinking: A Bayesian course with examples in R and Stan. CRC Press.

Morey, R. D., Hoekstra, R., Rouder, J. N., Lee, M. D., & Wagenmakers, E.-J. (2016). The fallacy of placing confidence in confidence intervals. Psychonomic Bulletin & Review, 23(1), 103–123.

Nicenboim, B., & Vasishth, S. (2016). Statistical methods for linguistic research: Foundational Ideas – Part II. Language and Linguistics Compass, 10(11), 591–613.